Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
https://arxiv.org/abs/2306.05685
https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge
IMO:GPT-4のような強いLLMを使って、人間による評価と同じ水準で呼応するLLM-as-a-judgeができるとのこと
Abstractより、チャットアシスタントに基づくLLMの評価
we explore using strong LLMs as judges to evaluate these models on more open-ended questions
We then verify the agreement between LLM judges and human preferences by introducing two benchmarks: #MT-bench , a multi-turn question set; and #Chatbot_Arena , a crowdsourced battle platform. (Abstract)
agreementは「呼応」
4で評価
Our results reveal that strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans.
we study the LLM-as-a-judge approach by comparing it to the gold standard of human evaluation (1 Introduction)
Table 1
MT-Bench (2.2)
We create MT-bench, a benchmark consisting of 80 high-quality multi-turn questions.
Chatbot Arena (2.3)
Chatbot Arena, a crowdsourcing benchmark platform featuring anonymous battles.
users can interact with two anonymous models simultaneously, posing the same question to both.
投票
C.2 Chatbot Arena: Benchmarking LLMs in the Wild with Elo Ratings
3が積ん読(興味深い)
4 Agreement Evaluation
We randomly sample 3K single-turn votes from 30K arena data (4.1)
Figure 3
同じ系列のモデルを高く評価するバイアス(例:Claude (b)のグラフ)
We propose 3 LLM-as-a-judge variations (3.1)
Appendix AにPrompt Template
Pairwise comparison(Figure 5)
Single answer grading(Figure 6)
Reference-guided grading(Figure 8)
(Figure 10まである)
We present a few methods to address position bias and the limited grading ability for math questions (3.4)
Appendix B Case Studyでbiasの報告
Appendix D Additional Experimental Results
⚠️LLM構築におけるインストラクションの効果と人間とGPT-4による評価で観察されたものによると
Our results reveal that strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans.
Appendix (D.3?) によると同等を除いているらしい(関根さん)
「人が A 50%, B 50%であれば、GPT-4がAとしたら 50% 一致と判断する」(human-majority)
For example, if there are an equal number of “A” and “B” human votes for a question, and GPT4 votes “A”, the agreement is counted as 1/2 on this question.